import emoji
Attrition is a problem that impacts all businesses and it leads to significant costs for a business, including the cost of business disruption, hiring new staff and training new staff.
Therefore, businesses, in particular their HR departments have great interest in understanding the drivers of, and minimizing staff attrition. The use of classification models to predict if an employee is likely to quit could greatly increase HR’s ability to intervene on time and remedy the situation to prevent attrition.
Staff attrition refers to the loss of employees through a natural process, such as retirement, resignation, elimination of a position, personal health, or other similar reasons. With attrition, an employer will not fill the vacancy left by the former employee.
The main data source is the employee-attrition.csv file that contains 1470 HR entries. Given the limited size of the data set, the model should only be expected to provide modest improvement in identification of attrition vs a random allocation of probability of attrition.
| Name | Description |
|---|---|
| ✔️ AGE | Numerical Value |
| 🆕 age55 | If the worker has more than 54 years of age |
| ⭐ ATTRITION | Employee leaving the company (0=no, 1=yes) |
| 🔶 BUSINESS TRAVEL | (1=No Travel, 2=Travel Frequently, 3=Tavel Rarely) |
| 🆕 business_travel_cont | Ordered traveling variable |
| ✔️ DAILY RATE | Numerical Value - Salary Level |
| 🔶 DEPARTMENT | (1=HR, 2=R&D, 3=Sales) |
| 🆕 Department_gr2 | HR and Sales vs R%D |
| ✔️ DISTANCE FROM HOME | Numerical Value - THE DISTANCE FROM WORK TO HOME |
| 🔶 EDUCATION | Numerical Value |
| 🆕 education5 | Dichotomical variable 5 vs others |
| 🔶 EDUCATION FIELD | (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL DEGREE) |
| 🆕 EducationField_gr2 | 0 (Life sciences, medical sciences, others) vs 1(MARKETING, HR, TECHNICAL) |
| ❌ EMPLOYEE COUNT | Numerical Value |
| ❌ EMPLOYEE NUMBER | Numerical Value - EMPLOYEE ID |
| ✔️ ENVIROMENT SATISFACTION | Numerical Value - SATISFACTION WITH THE ENVIROMENT |
| ✔️ GENDER | (1=FEMALE, 2=MALE) |
| ✔️ HOURLY RATE | Numerical Value - HOURLY SALARY |
| ✔️ JOB INVOLVEMENT | Numerical Value - JOB INVOLVEMENT |
| ✔️ JOB LEVEL | Numerical Value - LEVEL OF JOB |
| 🔶 JOB ROLE | (1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5= MANAGING DIRECTOR, 6= REASEARC DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE) |
| 🆕 JobRole_gr2 | 0 ( "Manufacturing Director","Healthcare Representative", "Manager", "Research Director") vs 1 ("Human Resources","Laboratory Technician", "Sales Executive", "Sales Representative","Research Scientist") |
| ✔️ JOB SATISFACTION | Numerical Value - SATISFACTION WITH THE JOB |
| 🔶 MARITAL STATUS | (1=DIVORCED, 2=MARRIED, 3=SINGLE) |
| ✔️ MONTHLY INCOME | Numerical Value - MONTHLY SALARY |
| ✔️ MONTHY RATE | Numerical Value - MONTHY RATE |
| 🔶 NUMCOMPANIES WORKED | Numerical Value - NO. OF COMPANIES WORKED AT |
| 🆕 NumCompaniesWorked_gr2 | Number of companies worked at > 4 |
| ❌ OVER 18 | (1=YES, 2=NO) |
| ✔️OVERTIME | (1=NO, 2=YES) |
| ✔️PERCENT SALARY HIKE | Numerical Value - PERCENTAGE INCREASE IN SALARY |
| ❌ PERFORMANCE RATING | Numerical Value - PERFORMANCE RATING |
| 🔶 RELATIONS SATISFACTION | Numerical Value - RELATIONS SATISFACTION |
| ❌ STANDARD HOURS | Numerical Value - STANDARD HOURS |
| 🔶 STOCK OPTIONS LEVEL | Numerical Value - STOCK OPTIONS |
| 🔶 TOTAL WORKING YEARS | Numerical Value - TOTAL YEARS WORKED |
| ✔️TRAINING TIMES LAST YEAR | Numerical Value - HOURS SPENT TRAINING |
| ✔️WORK LIFE BALANCE | Numerical Value - TIME SPENT BEWTWEEN WORK AND OUTSIDE |
| ✔️YEARS AT COMPANY | Numerical Value - TOTAL NUMBER OF YEARS AT THE COMPNAY |
| ✔️YEARS IN CURRENT ROLE | Numerical Value -YEARS IN CURRENT ROLE |
| ✔️YEARS SINCE LAST PROMOTION | Numerical Value - LAST PROMOTION |
| ✔️YEARS WITH CURRENT MANAGER | Numerical Value - YEARS SPENT WITH CURRENT MANAGER |
I have used Python 3.9.4 in Visual Studio Code with the next packages:
I have used a wonderfull package called pandas_profiling to create an html profile of the data. We can check distributions and other information like missing data, correlation between variables and more.
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv('../Data/employee-attrition.csv')
df.head()
profile = ProfileReport(df, title="Crayon Attrition Data")
profile.to_file("../Results/EDA_001_attrition.html")
profile.to_notebook_iframe()
import sweetviz as sv
my_report = sv.analyze(df, target_feat = "Attrition")
my_report.show_html(filepath='../Results/EDA_002_attrition.html')
my_report.show_notebook()
Summarize dataset: 100%|██████████| 273/273 [00:50<00:00, 5.43it/s, Completed] Generate report structure: 100%|██████████| 1/1 [00:10<00:00, 10.69s/it] Render HTML: 100%|██████████| 1/1 [00:08<00:00, 8.73s/it] Export report to file: 100%|██████████| 1/1 [00:00<00:00, 9.26it/s]